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Abstract 



LlPackv2 is a Mathematica package that contains a number of algorithms that 
can be used for the minimization of an ^i-penalizcd least squares functional. The al- 
gorithms can handle a mix of penalized and unpcnalized variables. Several instructive 
examples are given. Also, an implementation that yields an exact output whenever 
exact data are given is provided. 

1 Introduction 

In physical and applied sciences, one often encounters the situation where the quantities in 
which one is ultimately interested cannot be measured directly. Such a situation appears 
e.g. in magnetic resonance imaging (Fourier components of an image are measured), optics 
(convolution of an object with a non-trivial point spread function forms an image) etc. 

Often, it happens that a linear relationship exists between the data and the quan- 
tities of interest. Nonetheless, several issues stand in the way of 'solving' such linear 
inverse problems. Frequently, the data are incomplete (underdetermined system from a 
mathematical point of view) and contaminated with noise (inconsistent equations from a 
mathematical point of view). Moreover, the linear operator linking both is often singular. 

A much-used strategy to arrive at a well-defined solution for the above mentioned 
linear equations, is to formulate the problem as the minimization of a quadratic functional 
consisting of the sum of the discrepancy \\Kx — y\\ 2 and a penalization contribution ||x|| 2 : 




in terms of the noisy data y and the linear operator K. The regularization parameter 
A controls the trade-off between the data misfit y — Kx and the ^2-norm of the model 
parameters. With a proper choice of A, the influence of noise (present in the data y) on 
the reconstruction of x can now be kept under control [H [3] . 
The resulting variational equations 




argmin \\Kx — y\\ 2 + \\\x 




K T Kx + \x = K T y 



(2) 



for the minimization problem (pQ) are linear; a considerable advantage of this technique 
is that existing efficient linear solvers (i.e. thoroughly tested algorithms and computer 
software) can be used. A disadvantage of using a penalty term of type A||x|| 2 is that it is 



quite generic and does not take into account any a priori information that one may have 
about the desired reconstructed object (other than being of finite size). 

Recently, scientist have become interested in using 'sparsity-promoting' penalizations 
for the solution of linear ill-posed inverse problems. In particular, the following minimiza- 
tion problem involving an ^-penalized least squares functional has attracted a great deal 
of attention: 

x(X) = argmin \\Kx — y\\ 2 + 2A||:c||i (3) 

for a given real matrix K, real data y and positive penalization parameter A > 0. The 
^i-norm is defined as ||x||i = J2i \ x i\- 

The minimization problem ([3]) is commonly referred to as 'lasso' after [3], although 
earlier uses of this technique do exist [5]. In the context of signal decomposition, the term 
'basis pursuit denoising' is often used [6]. 

Compressed sensing [8], a rapidly developing research area, is based on the re- 
alization that a signal of length N can often be fully recovered by far fewer than N 
measurements if it is known in advance that the signal is sparse. The recovery is done 
by calculating arg ima.K X =y IM|i> or i n case of the presence of noise by calculating the 
minimizer ([3]) for which it is known that x(X) is typically sparse, i.e. a great number of 
its components are exactly zero (at least for a large enough penalty A). 

Unfortunately, the variational equations that describe the minimizer ([3]) are nonlinear 
(see equations ([7]) below). The use of a traditional £2 penalization leads to linear equations 
but a non-sparse minimizer. Hence the question of finding ways of efficiently recovering 
the minimizer of functional ([3]) has attracted a great deal of interest [9l [TUl [HI Q2J Q3] . 

The Mathematica package LlPackv2 implements a couple of algorithms for finding the 
minimizer ([3]). There are five iterative algorithms which yield an approximation of x(X) 
when stopped after a finite number of steps. The principal algorithm however can calculate 
the minimizer x(X) exactly (up to computer round-off) in a finite number of steps. This 
algorithm is known as the 'homotopy method' [14] or as Least Angle Regression (LARS) 

m- 

The algorithms in this package can also treat the minimization of a slightly more 
general functional: 

minllKx — y\\ 2 + 2\*S^ wAxA (4) 

x z — < 

i 

with positive (or zero) weights W{ > 0. 

The procedures are all written with real matrices and signals in mind. They do not 
work for complex variables. Extension to complex data would require optimization over 
quadratic cones; the thresholded Landweber algorithm could be easily adapted, but not 
the other algorithms. The package was developed and tested with Mathematica 5.2 [16] . 
Other toolboxes geared towards the recovery of sparse solutions of linear equations exist 
for Matlab: SparseLab [17j . £i-magic [15] . 

2 Implementation 

In order to describe the solution of the minimization problem ([3]), it is worthwhile men- 
tioning the following four equivalent problems: 

1. the minimization of the penalized least squares functional: 

x(X) = argmin \\Kx — y\\ 2 + 2A||a?||i (5) 



2 



2. the constrained minimization problem: 



x(R) = arg min \\Kx — y\\ (6) 

||x||i<i?, 

(with an implicit relation between R and A, see below) 

3. the solution of the variational equations: 

{K T (y - Kx))i = A sign(xj) if / , , 

K^^-ifx)),! < A if xi = {<) 

4. the solution of the fixed-point equation: 

x = S x [x + K T (y-Kx)}, (8) 
where S\ is soft-thresholding applied component-wise (see expression (USD). 
One has R = ||x(A)||i and also: 

A = \\K T (y - Kx(X))\\ 00 = m a x\(K T (y - Kx(X)))i\ (9) 

i 

which follows immediately from relations ©. 

The equivalence of the four formulations dSHS]) can easily be seen as follows: Con- 
strained minimization ([6]) is turned into problem ([5]) by the introduction of a Lagrange 
multiplier A. The variational equations ([7]) are derived from ([5]) by putting x = x + hei 
(where h is a small parameter and is the ith. basis vector) and expressing that \\Kx — 
y\\ 2 + 2X\\x\\i < \\Kx — y\\ 2 + 2A|jx||i for all h. The fixed-point equation is derived from 
equation ([7]) by adding Xi to both sides of the equalities, by rearranging terms and using 
S\(u) = u — Asgn(ii) for |n| > A and S\(u) = for \u\ < A. 

It can be shown that x(X) is a piecewise linear function of A and that x(X) = for 
A > A max = maxj \ {K T y)i\. Hence, it is sufficient to calculate the nodes of x(X); for generic 
values of A in between these nodes, one can use simple linear interpolation to determine 
x(X). 

The nodes of x(X) can be calculated using only a finite number of elementary operations 
(addition, multiplication, ...) using the LARS/homotopy method p~H [15]. In other 
words, despite the variational equations being nonlinear, they can still be solved exactly! 
In fact, the variational equations ([7]) are piece-wise linear and can be solved by starting 
from x(A = A max ) = and letting A decrease: at each stage where a component of x(X) 
becomes nonzero (or, in some exceptional cases, where it becomes zero again) a 'small' 
linear system has to be solved to ensure relations d7J); this procedure is stopped when the 
desired value of A is reached (or when some other stopping criterium such as e.g. ||x||i = R 
or |supp (x)\ = N, . . . is satisfied). 

Thus, the computational burden of such an algorithm consists of two contributions at 
every node: calculation of the remainder K T (y — Kx), and the solution of a linear system 
of size equal to the support size of that minimizer. The first contribution is the same at 
every step, the second is very small at the start (support size starts at 0) but dominates 
after a certain number of steps. 

Such an algorithm is implemented in this Mathematica package under the name FindMinimizer. 
If K and y are given in terms of exact numbers (integer, rational, . . . ), this algorithm will 



3 



yield an exact answer. If approximate numbers are used as input, then the algorithm will 
output a numerical solution. 

Other implementations exist (e.g. in Matlab \17\ [T9] or in R |15j). but they only work 
with approximate, floating-point, numbers. Being able to work with exact arithmetic is 
advantageous for pedagogical reasons: One can easily follow how the solution is found (the 
implementation can also print out intermediate results), at least for small matrices. 

Another advantage of this implementation is that it is able to handle an exceptional 
case that is left unaddressed in [151 El IE!- Indeed, in [15] it is mentioned that their 
implementation of the algorithm cannot handle the case where two new indices may en- 
ter the support of x at the same time. Using floating-point data, this is indeed a rare 
occurrence, but using small integer matrices and data, it is easy to find such exceptions 
(see several fully worked-out examples in section [5]) . Section [5] describes how the present 
implementation deals with that case. 

Probably the biggest advantage over existing software is that LlPackv2 can handle the 
problem of minimization of the functional 

x{\) = argmin \\Kx — y\\ 2 + 2A^ u>i|»i| (10) 

i 

with weights Wi > 0. If all weights are nonzero, then this can be reduced to the problem 
([5]) by a simple rescaling of the independent variables. However, if zero weights are present, 
this is no longer true. Still, the implementation provided by LlPackv2 handles that case 
(\15 \ \17 \ 119] do not). The variational equations now are: 

{K T (y-Kx))i = Wi A sign(xj) if Xi / , , 

\(K T (y-Kx))i\ < Wi X if x i = 

In case of the presence of unpenalized components (wi = for some i), the biggest differ- 
ence with the previous case is that the starting point is no longer the origin (and A max is 
no longer maxj \{K T y)i\). Also, for a given minimizer x(X), the penalty parameter equals 

A = max \(K T (y - Kx))i/w % \ (12) 

(compare with expression ©). Other than that, the algorithm (broadly speaking) follows 
the same lines. 

In addition to the LARS/homotopy method used in FindMinimizer, the package also 
provides five iterative algorithms for the minimization problems ([5]) and (|6|): 

1. ThresholdedLandweber 

2. ProjectedLandweber 

3. ProjectedSteepestDescent 

4. AdaptiveLandweber 

5. AdaptiveSteepestDescent 

After a finite number of steps, these yield an approximation of the minimizer. Whereas 
the LARS/homotopy method heavily is derived from the variational equations ([7]), the 
'thresholded Landweber' method (see formula (|15p ) is inspired by the fixed-point equation 
(JH]). The formulation ([6]) lays at the basis of the remaining four algorithms 2-5, see 
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formulas (|16p - (|19p . They use projection on an £i-ball, which can be calculated fairly 
easily in practice [10] . Two of these four algorithms use a path- following strategy (where 
the minimizer x(R) is sought for a whole range < R < R ma , x , just as the LARS/homotopy 
method), whereas the other two only calculate the minimizer x(R) for a specific value of 
R (just as the 'thresholded Landweber' algorithm calculates x(X) for a specific value of 
A)- 

All algorithms have an option to generate a list of intermediate results at each step (i.e. 
each node for the exact method, and each iteration for the iterative methods). One can 
use the following variables: iterate or node number, time elapsed since start, x, y — Kx, 
K T {y — Kx) (without having to make additional matrix multiplications!) and any function 
of these. This is useful for evaluating the behavior of the algorithms, or e.g. for easily 
being able to plot the trade-off curve, (||x(A)||i vs. \\Kx(X) — y\\ 2 ) as a function of A. An 
example is given in section [5J 



3 Main algorithm 

The main algorithm is called FindMinimizer and has syntax 

FindMinimizer [K_, y_, options ]. 

It calculates the minimizer x(X) of (|3|) with the 'exact' algorithm (i.e. homotopy method/LARS 
[14] 115]). It takes the operator K and the data y as input. The option Weights — > w can 
be used to set the weights Wi as in the formulation (110p . The default for this option is all 
weights equal to 1: Wi = 1. 

This algorithm starts from x = (assuming no zero weights) and lets ||x||i grow, and 
A descend until a stopping condition is met. There are four different stopping criteria 
possible: 

1. The option StoppingPenalty— > A will stop the algorithm when the penalty param- 
eter reaches this value. In other words, 

FindMinimizer [K , y , StoppingPenalty^ A] 

will calculate x(\) as in (|5|). The default is StoppingPenalty— > 0. 

2. The option MaximumLlNorm->i? will stop the algorithm when the ^i-norm of the mini- 
mizer reaches the value R. In other words, the command FindMinimizer \_K , y , MaximumLlNorm 
R] will calculate x(R) as in ([6]). The default is MaximumLlNorm— > oo which means 

that the default will never cause the algorithm to stop. 

3. The option MinimumDiscrepancy->d will make the algorithm stop when ||i^x— y\\ 2 = 
d. The default is MinimumDiscrepancy->0. 

4. The option MaximumNonZero^ will stop at the first node of x, having N nonzero 
coefficients. The default is MaximumNonZero— > oo. Unpenalized components (i for 
which Wi = 0) are always included in this count. 

The use of StoppingPenalty, MinimumDiscrepancy or MaximumLlNorm allows the user 
to stop the algorithm at exactly that value of A, \\Kx — y\\ 2 or ||x(A)||i respectively: the 
first node that overshoots this condition is computed, but linear interpolation with the 
previous node is used to arrive at exactly the value specified. This is very useful if you 
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Name 



Meaning 



Minimizer 

Counter 

DataMisf it 

Remainder 

Penalty 

Support 

Time 



x(\) 



0) 



index of the node (if ^ 0, the zeroth node is x 
y — Kx 
K T {y - Kx) 
A 

the support of x 

elapsed time since start (in seconds, but no 'Seconds' included) 



Table 1: A list of the variables that can be used in the ListFunction and 
StoppingCondition options of the FindMinimizer function. 



only need to find the minimizer x(X) for one specific value of A, \\Kx — y|| 2 or ||x||i. There 
is no need to compile the list of intermediate nodes and no need for doing the interpolation 
manually. 

Apart form the four specific stopping criteria above, one can also use a more general 
criterion: StoppingCondition— >cond. E.g. StoppingCondition :>(Time>=3) will make 
sure the algorithm stops at the first node that is calculated within 3 seconds. All the 
variables listed in table Q] can be used to express a stopping condition. The default is 
False meaning that it will not cause the algorithm to stop. 

Notice that the default behavior will make the algorithm stop at the last node, for 
which A = and K T (y — Kx) = 0. When using floating point data one should take care 
to specify a good stopping condition. Using floating point numbers, the default stopping 
condition, StoppingPenalty^ 0, most likely will not work (because maxj \(K T (y — Kx))i\ 
will eventually be ~ 10~ 16 or so, which is still strictly larger than 0); In this case, it is 
best to specify an explicit stopping condition of type StoppingPenalty^ A with A > 0. 
One could also use MaximumLlNorm— > R if one has a good idea for a suitable final ||x||i. 

The output of the FindMinimizer algorithm is of the form {{collected data}, a;}, or 
just x if no data are collected (default). The data that are to be collected at each node 
are described by the ListFunction option. Table Q] lists the variables that can be used. 
E.g. 

FindMinimizer IK , y , 

ListFunction : >{Norm [Minimzer , 1] , Norm [DataMisf it] 2 }] 

will make a list containing {|jx||i, \\Kx — y\\ 2 } at each node. This list could then be used 
to plot the trade-off curve (see section [5] for an example) . 

Finally, there is an option Verbose — > n that controls the amount of information that 
is printed while running. Verbose — > corresponds to no information at all. 

The FindMinimizer function temporarily switches off Mathematica division-by-zero 
warnings while it is running. 

The implementation in SparseLab (called SolveLasso) [17] can also calculate the min- 
imizers with only positive components. This is not possible in LlPackv2. 

4 Other algorithms and utilities 

Firstly, the following auxiliary functions are available. 
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• Sof t Threshold [x_, A_] : 

Soft-thresholding operation of x with threshold A > 0. It is defined as: 

{x — A x > A 

|x| < A (13) 

x + A x < —A 

If a; is a vector, S\ is applied component-wise. Here A can also be a vector. 

• ProjectionOnLlBall [x_, R_] : 

Projection of a; on a t\ ball of radius R. The projection Pr{x) of x on the ^i-ball 
with radius R is defined as: 

Pr{ x ) = ar g m i n 11^ ~~ u \\ 2 (14) 

||m||i=-R 

One can show that Pr{x) = S T (x) where r has to be chosen as a function of R and 
x [TO]. 

• MinimizerInterpolationFunction[minlist_,lambdalist_,var_] takes as input a 
list of nodes of x, and the corresponding value of the parameter A and returns x(X) 
evaluated at var (can be symbolic). S(A) is a piece- wise linear function of A, so the 
result of this procedure is a list of InterpolatingFunctions evaluated at var. The 
input range of the function is between the smallest and the largest A in lambdalist 

• CheckMinimizerList [K_,y_,minlist_, lambdalistj checks whether the points in 
minlist are consecutive nodes of the minimizer ([5]) (corresponding to A in lambdalist). 
I.e. it checks the fixed point equation (JSj) between these nodes. It is rather slow be- 
cause it uses Simplify and has to recompute the remainders. The result is True if 
the test is successful, False if the test fails and Indeterminate if the input is not 
exact (i.e. floating point), not of equal length or not sorted in order of descending 

A. The only options are Weights and Verbose. 

In addition to the FindMinimizer algorithm that uses the LARS/homotopy method 
to calculate the minimizer ([3]), there are also five iterative algorithms available. They can 
be used to find an approximation of the minimizer ([5]) or ([6]) . The first three calculate the 
minimizer for one specific value of the penalty parameter A in ([5]) or one specific value of 
R in El 

• ThresholdedLandweber [K_,y_, lambda_, options ] implements the thresholded 

Landweber algorithm [S]: 

x (n+l) = Sx { x (n) + x T (y - KxW)] (15) 

Options are listed in table [2j The main advantage of this algorithm is that it is very 
simple. Convergence can be (very) slow, especially for small values of the penalty 
parameter A. 

• ProjectedLandweber [K_,y_,R-options ] implements the Projected Landweber 

algorithm |10j : 

X ( n +V =p R [ x (n) + K T {y-Kx {n) )} (16) 

Options are listed in table [21 This algorithm uses the ^i-norm of the minimizer as 
parameter instead of the parameter A. The nonlinear operator Pr is also slower than 
the soft-thresholding operator S\. 
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option default description 

StartingPoint {0, . . . , 0} specify the zeroth point in the iteration 

Weights {1, 1, . . . , 1} similar as for FindMinimizer 

ListFunction Null similar as for FindMinimizer 

StoppingCondition Counter>=l specify when the algorithm should stop. All. 

variables that can be used in ListFunction can 
also be used here. E.g.: StoppingCondition -> 
(Counter>=50 | | Time>=10) 

Table 2: Options available for ThresholdedLandweber, ProjectedLandweber, 
ProjectedSteepestDescent. Notice that the default behavior will only make one it- 
eration. 

• ProjectedSteepestDescent [K_,y_,R_, options ] implements 

x (n+l) = p R ^(n) + {n) R T ^ y _ Kx (n)y (17) 

with r W = K T (y - Kx^) and (3^ = \\r^ \\ 2 /\\Kr^ || 2 [10]. Options are listed in 
table EJ The main advantage of this algorithm is that, due to the variable step size 
faster convergence may be achieved |10j . 

The last two algorithms try to use an (approximate) path following strategy by grad- 
ually increasing the radius R of the £i-ball on which the minimizer is sought. This can 
also be interpreted as a 'warm start strategy' in which an (approximate) minimizer (for 
one value of ||cc||i) is used as the starting point for obtaining the minimizer with a larger 
value of ||sc||i. 

• Adapt iveLandweber [K_,y_,R_,numsteps_, options ] 

x (n+l) = p Rn [ x in) + R T^ y _ Kx (n)^ x (0) = q (lg) 

with R n = (n + l)i?/numsteps [10J. 

• Adapt iveSteepestDescent [K_,y_,R_,numsteps_, options ] 

x (n+l) = p Rn [ x i.n) + p(n) R T ^ _ Rx (n)^ x (0) = q (lg) 

withrW = K T (y-Kx^), (3^ = \\r^\\ 2 /\\Kr^\\ 2 and with Rn = (n+l)-R/numsteps 

For Adapt iveLandweber and AdaptiveSteepestDescent the starting point is always zero. 

One may also use linear operators instead of explicit matrices: The following functions 
need to be overloaded: Dot, Part, Dimensions, Length. This is not part of LlPackv2. 

The algorithms in this section are not available in SparseLab [T7], but the Matlab 
package also has a number of algorithms that use a path-following strategy. It also includes 
a number of methods geared towards the problem of finding argmin^ x= j / ||x||i or even 
arg minify ||x|| . 

5 Examples 

In this section a number of elementary but instructive examples are presented to clarify 
the FindMinimizer algorithm, to illustrate its possibilities, and to make the reader aware 
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of some special cases. A practical application of sparse regression is also included, as well 
as an indication of the problem sizes it is suitable for. 

Roughly speaking (there are exceptional cases), when the weights W{ all equal 1, the 
FindMinimizer algorithm performs the following actions at each step (in accordance with 
equation (J7|)): 

• calculate the remainder r = K T (y — Kx). 

• determine the components of r that are maximal (in absolute value). These will be 
the nonzero components in x. 

• calculate a walking direction v such that x + fiv will lead to a uniform decrease (i.e. 
equal across the different active components) of the maximal value of r (this involves 
solving a linear system). 

• walk as far as possible (choice of fj,) until another component of the remainder 
becomes equal in size to the maximal ones, or until one of the nonzero components 
of x becomes zero. 

Generically, only one additional component will enter the support of x at any given 
step. In the special case that the set argmaxj |rj| \ supp(x) has more than one element, 
i.e. there is more than one candidate new index, one needs to look more closely at the 
walking direction v to determine the right components to add to the support of x. The 
two conditions that have to be satisfied are: (1) the sign of the components (in the support 
of x) of v correspond to the sign of the components of r (so that r and x will have the 
same sign in that component, as prescribed by the equalities in ([7])); (2) the components of 
K T Kv that are not in the support of x are at least as large as those who are in the support 
(so that the corresponding components of r remain smaller or equal to the components 
that are in the support of x, as prescribed by the inequalities in (J7|). 

Each line in the following tables corresponds to a node in the minimizer x(A). For 
each node, some key information is given: the minimizer x(A), the ^i-norm ||x(A)||i, the 
penalty A, the remainder K T (y — Kx), the support of x and the size of the support of 
x(A). This will allow the reader (in every example) to check the accuracy of the result 
only by checking signs and comparing absolute values of components of the remainder (via 
equations (j7|). 

To save space, vectors are written in row form, and rows of the same matrix are 
separated by a semicolon. 

The first example is elementary. This can easily be done by hand. The operator K is 
the identity (in a five dimensional space) and the data is: y = (12, —8,5, 1,2). The list of 
nodes of the minimizer x(A) looks as follows: 



Node 


x(A) 


X ||i 


A 


K T {y - Kx) 


supp(x) 


supp(x) 





(0,0,0,0,0) 





12 


(12,-8,5,1,2) 


{} 





1 


(4,0,0,0,0) 


4 


8 


(8,-8,5,1,2) 


{1} 


1 


2 


(7,-3,0,0,0) 


10 


5 


(5,-5,5,1,2) 


{1,2} 


2 


3 


(10,-6,3,0,0) 


19 


2 


(2,-2,2,1,2) 


{1,2,3} 


3 


4 


(11,-7,4,0,1) 


23 


1 


(1,-1,1,1,1) 


{1,2,3,5} 


4 


5 


(12,-8,5,1,2) 


28 





(0,0,0,0,0) 


{1,2,3,4,5} 


5 



If one understands this example, one understands the whole algorithm. 
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At step zero, the remainder is maximal at component number 1. This will be the first 
nonzero component in x. The value 12 is the value of A at this node. Now we walk in the 
direction of the first unit vector until another component of the remainder becomes equal 
(in absolute value) to the first component. This happens when x\ = 4 and it defines the 
first node. The maximal absolute value of the components of the remainder is again the 
penalty A in this node (i.e. 8). Now we walk in the direction (1,-1,0,0,0), because it 
will make the first two components of the remainder decrease equally (in accordance with 
equation (J7|)). We stop at the point (7,-3,0,0,0) because the third component of the 
remainder is then equal to the maximal ones. We continue in this way until the maximal 
value of the remainder is zero. 

For K not equal to the identity, linear equations have to be solved at each step, and 
this is tedious by hand. 

The next example illustrates the fact that when the initial remainder has two compo- 
nents of equal (maximum) size (first and third in this case), it does not necessarily follow 
that the two corresponding components of x(A) become nonzero. Here, only the third 
component of x becomes nonzero in the first step. We have to wait until the third step 
to get a nonzero first component. The matrix is K = (—3, 4, 4; —5, 1, 4; 5, 1, —4) and the 
data is y = (24, 17, —7). The list of nodes is as follows: 



Node x(X) 



A 



K T (y - Kx) 



supp(x) |supp(x) 




1 

2 
3 
4 
5 



(0,0,0) 



0,0 



43 
' 16, 



192,106,192) 
63, 63] 



n 43 43 

, U > 1 fi 5 IK 



LIZ ML n 4 f o ZOO I ZOO ZOO ZOO 

70 1 TO 1 " I TO 79 1 70 1 70 1 70 



( 



15' 15 
172 301 
73 ' 73 
2356 4251 n 
991 ' 991 ' u 

4,5,-2) 




{} 
{3} 
{2,3} 
{1,2} 

{1,2} 
{1,2,3} 




1 

2 
2 
2 
3 



A similar phenomenon may occur anywhere in the calculation (any step), when the two 
or more new components of the remainder become equal to the maximum value of the 
remainder. 

In fact, choosing which components are the right ones to enter the support of x(\) is 
the hard part of the algorithm. Indeed, if one knows the support of x(\), one only needs 
to solve the linear system in ([7|) to find x. 

When using floating point arithmetic and real data, one is unlikely to encounter such 
cases. 

Below is the output generated, for the same operator K and data y, by the Matlab 
implementation [19]. Clearly, the first node is wrong, because the largest component of 
the remainder (third) does not correspond to the nonzero component in x (first). These 
points do not satisfy the fixed point equation ([8]); they are not minimizers. The correct 
minimizers were given above. 



Node 


X 


K T (y - Kx) 





(0,0,0) 


(-192.0000, 106.0000, 192.0000) 


1 


(-1.8298,0,0) 


(-84.0426, 84.0426, 96.8511) 


2 


(-1.0293,0,1.4009) 


(-58.4255, 71.2340, 71.2340) 


3 


(-2.1807,0.6141,0) 


(-55.9691, 68.7776, 68.7776) 


4 


(-2.5971,3.8760,0) 


(7.7421,5.0664,-5.0664) 


5 


(-10.3550,7.6758,-9.7765) 


(2.6758,-0.0000,0.0000) 
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These numbers were calculated with the Matlab command 



Pill 


A 


K T {y - Kx) 


supp(x) 


|supp(x)| 





112 


(-112,105,56) 


{} 





7 
1 


217 
4 


( 217 217 175 N 
\ 4 ' 4 ' 4 , 


1 {1} 


1 


7 
3 


133 

3 


/ 133 133 112' 
\ 3 ' 3 ' 3 , 


I {2} 


1 


49 
18 


308 
9 


/ 595 308 308^ 
V 18 ' 9 ' 9 , 


I {2} 


1 


49 
9 


19 

9 


/ 49 49 49 A 
V 9 ' 9 ' 9 ) 


{2,3} 


2 


6 





(0,0,0) 


{1,2,3} 


3 



lars (mat , data , 'lasso' ,0,0) 

where mat and data correspond to K and y with a final all-zero line appended. 

The problem stems from the fact that the first remainder has two equal largest compo- 
nents. The Matlab implementation does not handle this case. Unfortunately, the Matlab 
script does not give any warning. 

The SparseLab [T7] implementation SolveLasso also gives a wrong answer (without 
warning) in this case. The first breakpoint returned by SolveLasso is x = (5.3750, 0, 9.4062); 
the corresponding value for K T (y—Kx) is (—20, 20, 20). The signs of the first component 
of x and K T (y — Kx) are unequal, in violation of equation ([7]). 

The third example shows that a nonzero component of x may be set to zero again at 
smaller values of A: K = (-4, 3, -1; -4, 4, 3; -1, 1, -1) and y = (7, 21, 0). 

Node x(\) 
~0 (0,0,0) 

1 f-I,o,o 

2 (0,I,0) 

3 (0,§,0 

4 (0 25 1 

v ' 9 ' 3 

5 (-1,2,3) 

It follows that the number of nodes and the size of the support do not grow at the same 
rate. In figure [3] there is a numerical example that illustrates this phenomenon more 
clearly. 

The final example illustrates the uses of different weights for the different terms in the 
penalty as in expression (jlOp . The matrix and data are the same as in the second example 
and the weights are w = (2, 1, 0). The list of nodes now looks like: 

Node x{\) A K T {y — Kx) supp(x) |supp(x)| 

"0" (0,0, -i) I f W i 

1 (0,-^f) S S («>-8>0) {2,3} 2 

2 (3,-4,1) 8 (0,0,0) {1,2,3} 3 

When there is a zero weight (as is here the case for the third component of x), the starting 
point is no longer necessarily the origin: the component with zero weight is nonzero. The 
corresponding component of the remainder must equal zero at every step. To determine 
the penalization parameter A we have to look at the remainder divided by the weights 
(component by component division, and excluding the components with zero weights). In 
(llr> — TF'Oj /(2, 1,0), the second component is the largest, and this one will enter the 
support of x in the next step. 

These tables were, of course, generated automatically with the help of FindMinimizer 
function and the built-in Mathematica TeXForm and MatrixForm commands. 

The remaining part of this section is devoted to a small sparse regression problem. 

We prepare a random matrix K of size 30 by 100 with elements taken from a normal 
distribution. We also prepare a sparse input model x; n with 10 nonzero components (see 
figured] left). We calculate data y as y = Kx m + e where e is a noise vector with elements 
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Figure 1: Left: The (sparse) input model x- m . Middle: the (sparse) ^-penalized recon- 
struction x^ 1 ). Right: The ^-penalized reconstruction x^ 2 ' . 

taken from a normal distribution. The norm of the noise vector is 3% of the norm of the 
noiseless data: ||e|| = 0.03||Axi n ||. In other words, we construct an underdetermined toy 
regression problem with 100 unknowns, 30 noisy data, and 10 nonzero coefficients in the 
sought after solution. 

We compare two reconstruction methods: the t\ penalized method (|3|) and the £2 
penalized method ([1]). In both cases the penalty parameter is chosen in such a way that 
the discrepancy of the solution equals the noise level: \\Kx — y|| 2 = ||e|| 2 . 

For the ^-penalized method we can simply use the FindMinimizer algorithm: FindMinimizer \_K , y , Minimu 
||e|| 2 ] . For the ^-penalized reconstruction we solve the linear system ([2]), where A is chosen 
manually in such a way as to have \\Kx — y\\ 2 = ||e|| 2 . 

These two reconstructions, x^ and x^ 2 \ are thus equivalent from the point of view 
of the data: they fit the data equally well, and no better than the noise level justifies. 
However if we compare these two reconstruction with the sparse input model Xj n , we can 
see that the £\ method clearly succeeds in producing a better reconstruction than the 
£2 method. This toy example thus demonstrates the so-called compressed sensing idea, 
whereby a signal can be faithfully reconstructed from few measurements provided one 
knows that the signal is sparse [8] . 

Figure [2] (left) illustrates the trade-off curve for this problem. The solid line represents 
the curve (||x(A)||i, ||Ax(A) — y\\ 2 ), for varying A. The points on the trade-off curve can 
easily be calculated with the single command: 
FindMinimizer [K ,y ,MinimumDiscrepancy— > ||e|| 2 , 
ListFunction: >{Norm[Minimizer , 1] ,Norm[DataMisfit] 2 }] . 

For comparison, the approximate trade-off curve traced out by the adaptive Landweber 
algorithm (|18p . 10 steps, is also pictured. Figure [2] (right) pictures the reconstruction error 
||x — x( n )||/||x|| as a function of the iteration step n. Clearly, the iterative algorithms do 
not perform very well, and the exact (i.e. L ARS/homotopy) method is to be preferred for 
this type of problem. 

Finally, we make a numerical test of the FindMinimizer algorithm on a random 700 x 
900 matrix. For large numerical matrices it is best to always provide an explicit stopping 
condition (most likely the user is not interested in the default: limA^o^(A)). We keep 
track of some intermediate information (node number, time, support size and error) that 
is used for plotting in figure [3l More than 1400 nodes take just under 8 minutes (on a 
2GHz workstation). 

numdata = 700; numvars = 900; 

mat=Table [Random [Real , {-3 , 3}] , {numdata} , {numvars}] ; 
data=Table [Random [Real , {-3 , 3}] , {numdata}] ; 
sol=FindMinimizer [mat , data, ListFunction :> {Counter, Time, 
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Figure 2: Left: Trade-off curve (solid line) for the toy regression problem. Horizontal axis is 
|| a? ||i, vertical axis is — y|| 2 . Diamonds indicate the nodes of x(A). The solid line is the 
trade-off curve (it passes through the diamonds). The stars represent the path followed 
by the Adapt iveLandweber algorithm Q18|) (10 steps). Right: Convergence properties 
of different algorithms. Vertical axis represents the distance to the true minimizer (i.e. 

_ jc||/||a?||). Horizontal axis is the iteration number (n). Diamonds represent the 
nodes of the exact algorithm, triangles are for the thresholded Landweber algorithm (|15p 
and stars are for the adaptive Landweber algorithm 1181 

Total [Abs [Sign [Minimizer] ] ] , Norm [Minimizer - 

Sof tThreshold [Minimizer+Remainder , Penalty] ] /Norm [Minimizer] }] ; 

These results were plotted in figure EJ On the left we see that the number of nodes 
and the number of nonzero components grows equally fast at first. Later on however, as 
the algorithm progresses, the support size does not grow with each node anymore (some 
nonzero coefficients are set to zero again). On the right we see that the relative error 
\\x — S\(x + K T (y — Kx))\\/\\x\\ (cfr. formula (}8|)) stays bounded, even after more than 
one thousand nodes. This shows that the FindMinimizer algorithm is very well suited to 
handle problems for which the sought after minimizer has fewer than about 1000 nonzero 
components. 

6 Summary 

A Mathematica package for the minimization of l\ penalized functionals was described. 

The FindMinimizer algorithm finds the exact minimizer of such a functional, as long 
as exact input is given. For floating point data, the same algorithm can be used to obtain 
a (very good) numerical approximation of the minimizer. 

The FindMinimizer algorithm also handles the case of different penalties for different 
components: XJ2i w i\ x i\ instead of A||x||i, including the case where some of the Wi are 
zero. 

In [20] , a wavelet basis is used together with the iterative soft-thresholding algorithm (|15j) 
to perform denoising. A wavelet decomposition consists of wavelet coefficients (encoding 
local detail) and scaling coefficients (encoding large scale averages). This is an example 
where it makes very good sense to penalize different components differently or not to 
penalize (at all) the scaling coefficients. A geophysical example of the use of such a 
technique in seismology can be found in |21| . 

In many ways, this FindMinimizer algorithm is an improvement over the Matlab 
implementation |191I17|; the latter cannot handle exact numbers (integer, rational), cannot 
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Figure 3: Left: The node number as a function of the support size of the minimizer at 
that node. At first these increase equally (straight line), but, as the remainder approaches 
zero, the number of nodes increases faster than the number of nonzero components in the 
minimizer. At the end there are more than twice as many nodes as there are nonzero 
components in x. Right: The relative error \\x — S\(x + K T (y — Kx))\\/\\x\\ as a function 
of the node number for the same example. Ideally this should be zero as the minimizers 
are supposed to satisfy the fixed point equation ([8]). Due to numerical round-off it is not 
exactly zero. The matrix K and data y were chosen at random (see text). 

handle zero weights, and sometimes does not give the correct answer (see the fourth 
example in section [5]). The present algorithm can also compile a list of intermediate 
results on the fly. The intermediate value of x, y — Kx and K T (y — Kx) have to be 
computed anyway; if one would do this after finishing the algorithm (based on a list of the 
nodes), one has to perform a great number of matrix multiplications all over again thereby 
doubling the computing time. Also, it might be impossible to store all the intermediate 
minimizer nodes, rendering the latter strategy impossible in certain cases. Again, this is 
not possible with |17t QjJ] (it only outputs a list of all the intermediate minimizers) . 
Also with regard to the possible stopping conditions, the present algorithm has more 
possibilities: [19] can stop at a predetermined number of nonzero components, or at a 
predetermined 1-norm of the minimizer. SolveLasso in SparseLab [17] can stop at the 
first breakpoint with penalty parameter or discrepancy below a predefined level (though 
no interpolation with the previous breakpoint is done to arrive at the exact value) or after 
a given number of steps. FindMinimizer has a generic stopping condition but can also 
stop at a predetermined value of the penalization parameter A, of ||x||i or at a specific 
value of the discrepancy \\Kx — y|| 2 exactly. The latter is very useful in practice when 
using the so-called discrepancy principle [22]: the minimizer should not fit the data better 
than the level of the measurement noise. 

The FindMinimizer algorithm was tested on hundreds of thousands of randomly gen- 
erated matrices and data (both square and rectangular). The results were verified and 
no wrong outcomes were discovered. This is no guarantee that there aren't any bugs left. 
Hence, it is recommended that users would check the results returned by the algorithms 
by checking equations (|7|) or (fTT|) , 

The FindMinimizer function does not give a warning when the solution is not unique. 
If using floating point numbers and a warning about ill-conditioned matrix is generated, 
this may mean that the minimizer is not unique. 

The worst case scenario for the FindMinimizer procedure happens when the first 
remainder K T y has all components of the same size. In this case, the algorithm has 
to resort to trying all possibilities to determine the new index/indices (there are 2 n — 1 
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nonempty subset of {1,2,..., n}). Such a special case 'never' happens in practice, but it 
is easy to manually construct such examples. 

Five other useful algorithms, like the iterative soft-thresholding algorithm (|15p and 
algorithms that use projection on an ^i-ball (see expressions (fT6HT9j) ) are also implemented. 
They also support the use of (zero and nonzero) weights. In general, after a finite number 
of steps, these only yield an approximative minimizer. 

For clarity, only examples with very small size matrices were given in this manuscript. 
However, the package was also tested on the geo-physics application discussed in [21] . 
and FindMinimizer succeeded in performing that minimization in less than one minute. 
The package was also used to evaluate the performance of competing iterative methods 
(other than the ones also discussed here) [23]. The comparison of the relative strength 
and weaknesses of these different algorithms lies beyond the scope of this work. 
Quite a number of other methods for finding the minimizer ([3]) are available in Sparse- 
Lab [TTj. Some are designed to find the related argmxaK X=y \\x\\i instead of ([3j). The 
^i-magic package [18] also provides a number of primal-dual interior point methods for 
solving this and related problems in Matlab; it does not have an implementation for the 
Lasso/homotopy method. 

Future work consists of extending the exact algorithm to the case of linearly constrained 
minimization: 

min \\Kx — y\\ 2 + 2A||x||i with Ax = a, (20) 

i.e. minimization of the same l\ penalized least squares functional as before, but now under 
additional linear constraint(s) described by the linear equation(s) Ax = a. Although ap- 
proximative methods do exist for this problem, a L ARS/homotopy-style (exact) method 
for constrained minimization was not described in |14|. 115] . It is possible to solve this 
problem exactly and a prototype of such an exact algorithm, based on the FindMinimizer 
algorithm, has already been implemented and used in an application of portfolio construc- 
tion (in finance) [24j . 
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